In this script we assess whether perturbation indicators are associated with any of the technical covariates. We do this via both joint and marginal analyses. In the joint analysis, we run a logistic regression of the perturbation indicators on all technical covariates. We extract \(p\)-values for each individual covariate, as well as the \(p\)-value for the chi-squared test for the group of all covariates. In the marginal analysis, we run logistic regressions of the perturbation indicators on the technical covariates one by one. All analyses are done on the set of cells receiving a non-targeting gRNA, to match the set of cells used in our undercover analysis.
Most of the technical covariates appear to have no effect on the
propensity of perturbation, with the exception of bio_rep
and lane for the Papalexi data and batch for
the Schraivogel enhancer screen datasets. Let’s take a closer look at
these.
Let’s cross-tabulate biological replicate with NTC in the Papalexi data:
df <- lowmoi::get_data_for_pert_prop("papalexi", "eccite_screen") |>
dplyr::mutate(assigned_grna = factor(assigned_grna, levels = paste0("NTg", c(1:5,7:10))))
df |>
dplyr::select(assigned_grna, bio_rep) |> table()
## bio_rep
## assigned_grna rep_1 rep_2 rep_3
## NTg1 135 111 57
## NTg2 53 27 26
## NTg3 67 65 112
## NTg4 259 100 72
## NTg5 107 105 93
## NTg7 177 157 111
## NTg8 5 23 28
## NTg9 96 77 88
## NTg10 66 90 79
It does look like different NTCs are overrepresented in different
biological replicates (e.g. NTg3 is overrepresented in
rep_3 while NTg4 is overrepresented in
rep_1). Let’s verify this using a chi-squared test of
independence:
chisq.test(df$assigned_grna, df$bio_rep)
##
## Pearson's Chi-squared test
##
## data: df$assigned_grna and df$bio_rep
## X-squared = 178.75, df = 16, p-value < 2.2e-16
So there is indeed a very significant association between biological replicate and assigned gRNA.
Let’s cross-tabulate lane with NTC in the Papalexi data:
df |>
dplyr::select(assigned_grna, lane) |> table()
## lane
## assigned_grna Lane1 Lane2 Lane3 Lane4 Lane5 Lane6 Lane7 Lane8
## NTg1 26 31 43 35 39 45 41 43
## NTg2 15 12 12 14 12 11 14 16
## NTg3 17 15 17 18 40 48 50 39
## NTg4 66 54 70 69 30 44 45 53
## NTg5 22 33 27 25 38 44 54 62
## NTg7 36 46 38 57 47 83 76 62
## NTg8 2 2 1 0 10 15 16 10
## NTg9 30 19 22 25 29 48 36 52
## NTg10 17 11 20 18 39 44 40 46
One thing that sticks out is that NTg8 is found in way
fewer cells than the other NTCs. Let’s look more into this:
df |> dplyr::count(assigned_grna)
## assigned_grna n
## 1 NTg1 303
## 2 NTg2 106
## 3 NTg3 244
## 4 NTg4 431
## 5 NTg5 305
## 6 NTg7 445
## 7 NTg8 56
## 8 NTg9 261
## 9 NTg10 235
Indeed, NTg8 is found in only 56 cells, while the other
NTCs are usually found in hundreds of cells. Looking at these counts,
NTg2 looks a little low as well. Let’s return to the
question of association between lane and assigned gRNA. We conduct a
chi-squared test:
chisq.test(df$assigned_grna, df$lane)
##
## Pearson's Chi-squared test
##
## data: df$assigned_grna and df$lane
## X-squared = 175.26, df = 56, p-value = 3.264e-14
We see that the \(p\)-value is quite significant. Let’s exclude the two NTCs with small number of cells to see whether the effect persists:
df_smaller <- df |> dplyr::filter(!(assigned_grna %in% c("NTg2","NTg8")))
chisq.test(df_smaller$assigned_grna, df_smaller$lane)
##
## Pearson's Chi-squared test
##
## data: df_smaller$assigned_grna and df_smaller$lane
## X-squared = 142.77, df = 42, p-value = 6.653e-13
Indeed, the effect persists.
Let’s cross-tabulate batch with NTC in the Schraivogel enhancer screen data for chromosome 8:
df <- lowmoi::get_data_for_pert_prop("schraivogel", "enhancer_screen_chr8") |>
dplyr::mutate(batch = factor(batch, levels = paste0("sample", 1:14)))
df |>
dplyr::select(assigned_grna, batch) |>
table()
## batch
## assigned_grna sample1 sample2 sample3 sample4 sample5 sample6 sample7
## non-targeting-00000 0 0 0 0 0 0 0
## non-targeting-00001 0 0 0 0 0 0 0
## non-targeting-00002 0 0 0 0 0 0 0
## non-targeting-00003 0 0 0 0 0 0 0
## non-targeting-00004 0 0 0 0 0 0 0
## non-targeting-00005 0 0 0 0 0 0 0
## non-targeting-00006 0 0 0 0 0 0 0
## non-targeting-00007 0 0 0 0 0 0 0
## non-targeting-00008 0 1 0 0 0 0 0
## non-targeting-00009 0 0 0 0 0 0 1
## non-targeting-00010 0 0 0 0 0 0 0
## non-targeting-00011 0 0 0 0 0 0 0
## non-targeting-00012 0 0 0 0 0 0 0
## non-targeting-00013 0 0 0 0 0 0 0
## non-targeting-00014 0 0 0 0 0 0 0
## non-targeting-00015 0 0 0 0 0 0 0
## non-targeting-00016 0 0 0 0 0 0 0
## non-targeting-00017 0 0 0 0 0 0 0
## non-targeting-00018 0 1 0 0 0 0 0
## non-targeting-00019 3 0 0 0 0 0 0
## non-targeting-00020 0 0 0 0 0 0 0
## non-targeting-00021 0 0 0 0 1 0 0
## non-targeting-00022 0 1 1 0 0 0 0
## non-targeting-00023 0 0 0 0 0 1 0
## non-targeting-00024 0 0 0 0 0 0 0
## non-targeting-00025 0 0 0 0 0 1 0
## non-targeting-00026 0 0 0 0 0 1 0
## non-targeting-00027 0 0 0 0 0 0 0
## non-targeting-00028 0 0 0 1 0 0 0
## non-targeting-00029 0 1 0 0 1 0 1
## batch
## assigned_grna sample8 sample9 sample10 sample11 sample12 sample13
## non-targeting-00000 0 0 0 0 0 0
## non-targeting-00001 0 0 1 0 0 0
## non-targeting-00002 0 0 0 0 0 0
## non-targeting-00003 0 0 0 0 0 0
## non-targeting-00004 0 0 0 0 0 0
## non-targeting-00005 0 0 0 0 0 0
## non-targeting-00006 0 0 1 0 0 0
## non-targeting-00007 0 0 0 0 1 1
## non-targeting-00008 0 0 0 0 0 0
## non-targeting-00009 0 0 0 0 0 0
## non-targeting-00010 0 0 0 0 0 0
## non-targeting-00011 0 0 0 0 0 0
## non-targeting-00012 0 0 1 0 0 0
## non-targeting-00013 0 0 0 0 0 0
## non-targeting-00014 0 0 0 0 0 0
## non-targeting-00015 0 0 0 0 0 0
## non-targeting-00016 0 0 1 0 0 0
## non-targeting-00017 1 0 0 1 0 1
## non-targeting-00018 0 0 0 0 0 0
## non-targeting-00019 0 0 0 0 1 0
## non-targeting-00020 0 0 0 0 0 0
## non-targeting-00021 0 0 0 0 0 0
## non-targeting-00022 0 0 0 0 0 0
## non-targeting-00023 0 0 0 0 0 0
## non-targeting-00024 0 0 0 0 0 0
## non-targeting-00025 0 1 0 0 0 0
## non-targeting-00026 0 0 0 0 0 0
## non-targeting-00027 0 0 0 0 0 0
## non-targeting-00028 0 0 0 0 0 0
## non-targeting-00029 0 0 0 0 0 1
## batch
## assigned_grna sample14
## non-targeting-00000 58
## non-targeting-00001 72
## non-targeting-00002 67
## non-targeting-00003 71
## non-targeting-00004 44
## non-targeting-00005 58
## non-targeting-00006 43
## non-targeting-00007 52
## non-targeting-00008 26
## non-targeting-00009 37
## non-targeting-00010 28
## non-targeting-00011 45
## non-targeting-00012 81
## non-targeting-00013 33
## non-targeting-00014 56
## non-targeting-00015 60
## non-targeting-00016 79
## non-targeting-00017 56
## non-targeting-00018 99
## non-targeting-00019 32
## non-targeting-00020 42
## non-targeting-00021 107
## non-targeting-00022 54
## non-targeting-00023 30
## non-targeting-00024 28
## non-targeting-00025 141
## non-targeting-00026 40
## non-targeting-00027 55
## non-targeting-00028 120
## non-targeting-00029 64
One thing that jumps out at us is that the vast majority of NTC-containing cells were sequenced in batch 14. Let’s confirm this:
df |> count(batch)
## batch n
## 1 sample1 3
## 2 sample2 4
## 3 sample3 1
## 4 sample4 1
## 5 sample5 2
## 6 sample6 3
## 7 sample7 2
## 8 sample8 1
## 9 sample9 1
## 10 sample10 4
## 11 sample11 1
## 12 sample12 2
## 13 sample13 3
## 14 sample14 1778
Let’s also apply a chi-squared test, calibrated based on permutation:
chisq.test(df$assigned_grna, df$batch, simulate.p.value = TRUE, B = 20000)
##
## Pearson's Chi-squared test with simulated p-value (based on 20000
## replicates)
##
## data: df$assigned_grna and df$batch
## X-squared = 476.86, df = NA, p-value = 0.0145
We see that the \(p\)-value is significant, but much less so that in the Papalexi data. However, note that batch may still have a very strong confounding effect when we test targeting gRNAs against NTCs, since in the population of all cells (rather than the population of just NTCs), there is a very strong association between batch and gRNA.
In the chromosome 11, again batch 14 dominates:
df <- lowmoi::get_data_for_pert_prop("schraivogel", "enhancer_screen_chr11") |>
dplyr::mutate(batch = factor(batch, levels = paste0("sample", 1:14)))
df |> count(batch)
## batch n
## 1 sample1 6
## 2 sample2 3
## 3 sample3 3
## 4 sample4 6
## 5 sample5 3
## 6 sample6 8
## 7 sample7 2
## 8 sample8 2
## 9 sample9 6
## 10 sample10 8
## 11 sample11 2
## 12 sample13 4
## 13 sample14 3026
However, the chi-square test does not come out significant:
chisq.test(df$assigned_grna, df$batch, simulate.p.value = TRUE, B = 20000)
##
## Pearson's Chi-squared test with simulated p-value (based on 20000
## replicates)
##
## data: df$assigned_grna and df$batch
## X-squared = 369.05, df = NA, p-value = 0.2502
It seems that the dominant effect here is not so much that different NTCs are overrepresented in different batches, but that all NTCs are overrepresented in batch 14.